Adaptive Information Extraction from Text by Rule Induction and Generalisation
نویسنده
چکیده
(LP) 2 is a covering algorithm for adaptive Information Extraction from text (IE). It induces symbolic rules that insert SGML tags into texts by learning from examples found in a user-defined tagged corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in tagging. Induction is performed by bottom-up generalization of examples in the training corpus. Shallow knowledge about Natural Language Processing (NLP) is used in the generalization process. The algorithm has a considerable success story. From a scientific point of view, experiments report excellent results with respect to the current state of the art on two publicly available corpora. From an application point of view, a successful industrial IE tool has been based on (LP) 2. Real world applications have been developed and licenses have been released to external companies for building other applications. This paper presents (LP) 2 , experimental results and applications, and discusses the role of shallow NLP in rule induction.
منابع مشابه
(LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts
(LP) is an algorithm for adaptive Information Extraction from Web-related text that induces symbolic rules by learning from a corpus tagged with SGML tags. Induction is performed by bottom-up generalisation of examples in a training corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in ta...
متن کامل(LP ): Rule Induction for Information Extraction Using Linguistic Constraints
Machine learning has been widely used in information extraction from texts in the last years. Two directions of research can be identified: wrapper induction (WI) and NLP-based methodologies. WI techniques have historically made scarce use of linguistic information and their application is mainly limited to rigidly structured documents. NLP-based methodologies tend to be brittle when linguistic...
متن کاملInformation Extraction from News Video using Global Rule Induction Technique
Global rule induction technique has been successfully used in information extraction (IE) from text documents. In this paper, we employ global rule induction technique to perform information extraction from news video documents. We divide our framework into two levels: shot; and story levels. We use a hybrid algorithm to classify each input video shot into one of the predefined genre types and ...
متن کاملINDUCING VALUABLE RULES FROM IMBALANCED DATA: THE CASE OF AN IRANIAN BANK EXPORT LOANS
<span style="color: #000000; font-family: Tahoma, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: -webkit-left; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none; ba...
متن کاملINDUCING VALUABLE RULES FROM IMBALANCED DATA: THE CASE OF AN IRANIAN BANK EXPORT LOANS
<span style="color: #000000; font-family: Tahoma, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: -webkit-left; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none; ba...
متن کامل